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Abstract. What role should assumptions play in inference? We present 
a small theoretical case study of a simple, clean case, namely the non- 
parametric comparison of two continuous distributions using (essen- 
tially) information about quartiles, that is, the central information dis- 
played in a pair of boxplots. In particular, we contrast a suggestion 
of John Tukey — that the validity of inferences should not depend on 
assumptions, but assumptions have a role in efficiency — with a com- 
peting suggestion that is an aspect of Hansen's generalized method 
of moments — that methods should achieve maximum asymptotic effi- 
ciency with fewer assumptions. In our case study, the practical perfor- 
mance of these two suggestions is strikingly different. An aspect of this 
comparison concerns the unification or separation of the tasks of esti- 
mation assuming a model and testing the fit of that model. We also look 
at a method (MERT) that aims not at best performance, but rather at 
achieving reasonable performance across a set of plausible models. 
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1. INTRODUCTION: A QUESTION AND 
AN EXAMPLE 

1.1 What Role for Assumptions? 

In his essay, "Sunset Salvo," Tukey (1986, page 72) 
advocated: 

Reducing dependence on assumptions . . . 
using assumptions as leading cases, not 
truths, . . . when possible, using random- 
ization to ensure validity — leaving to as- 
sumptions the task of helping with strin- 
gency. 

Although the comment is not formal, presumably 
"validity" refers to the level of tests and the cover- 
age rate of confidence intervals, while "stringency" 
refers to efficiency at least against some alterna- 
tives. Later in the essay (page 73), Tukey describes 
a statistic as "safe" if it is "valid — and of reason- 
ably high efficiency — in each of a variety of situa- 
tions." [Recall that a most stringent test minimizes 
the maximum power loss and that in many prob- 



2 



D. S. SMALL, J. L. GASTWIRTH, A. M. KRIEGER AND P. R. ROSENBAUM 



lems, uniformly most powerful tests are not avail- 
able; see Lehmann (1997).] The first part of Tukey's 
suggestion — "reduce dependence of validity on as- 
sumptions" — is today uncontroversial and there are 
many widely varied attempts to achieve that goal. 
The second part of Tukey's suggestion — "use as- 
sumptions to help with stringency" — runs against 
the grain of some recent developments which at- 
tempt to reduce the role of assumptions in obtaining 
efficient procedures. Do assumptions have a role in 
efficiency when comparing equally valid procedures? 
Or can we have it all, asymptotically of course, adapt- 
ing our procedures to the data at hand to increase 
efficiency? Our purpose here is to closely examine 
these questions in a theoretical case study of a sim- 
ple, clean case. We offer exactly the same informa- 
tion to two types of nonparametric procedures in a 
setting in which both are valid, though one chooses 
a procedure with high relative efficiency across a 
set of plausible models, while the other tries to be 
asymptotically efficient with fewer assumptions. The 
first method uses a form of rank statistic (Gastwirth, 
1966, 1985; Birnbaum and Laska, 1967). The second 
method is a particular case of Hansen's (1982) gen- 
eralized method of moments (GMM), widely used 



in econometrics. Both methods compare two distri- 
butions to estimate a shift. Both methods look at 
exactly the same information, somewhat related to 
the information about quartiles depicted in a pair 
of boxplots, but the methods use this simple infor- 
mation very differently. We compare the methods 
in a scientific example, in a simulation and using 
asymptotics. Also, we ask whether there is infor- 
mation against the shift model. We also show how 
to eliminate a shared assumption of both methods, 
namely the existence of a shift, which if false may 
invalidate their conclusions. 

In Section 1.2 a motivating example is described. 
In Section 2 notation and the methods of estimat- 
ing a shift are defined, and in Section 2.4 they are 
applied to the motivating example. The methods 
are evaluated by simulation in finite samples in Sec- 
tion 3, where some large sample results hold in quite 
small samples and others require astonishingly large 
samples. In Section 4 we dispense with the shift 
model. The relevant large sample theory is discussed 
in the Appendix with some patches needed to cover 
some nonstandard details. 
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Fig. 1. Genetic damage in radiation exposed and control groups. 
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1.2 A Motivating Example: Radiation in Homes 

In the early 1980s, a number of residential build- 
ings were constructed in Taiwan using 60 Co-con- 
taminated steel rods, with the consequence that the 
levels of radioactive exposure in these homes were 
often orders of magnitude higher than background 
levels. Chang et al. (1999) compared 16 residents of 
these buildings to 7 unexposed controls with respect 
to several measures of genetic damage, including the 
number of centromere-positive signals per 1000 bin- 
ucleated cells, as depicted in Figure 1. The sorted 
values for the 16 residents were 3.7, 6.8, 8.4, 8.5, 
10.0, 11.3, 12.0, 12.5, 18.7, 19.0, 20.0, 22.7, 24.0, 
31.8, 33.3, 36.0 and for the 7 controls were 3.2, 5.1, 
8.3, 8.8, 9.5, 11.9, 14.0. They reported means, stan- 
dard deviations and the significance level from the 
Mann- Whitney- Wilcoxon test. In Figure 1 the dis- 
tribution for exposed subjects looks higher, more 
dispersed and possibly slightly skewed right in com- 
parison to the controls, but of course the sample 
sizes are small and boxplots fluctuate in appearance 
by chance. Should one estimate a shift in the dis- 
tributions? If so, how? If not, what should one do 
instead? 

Suppose that the control measurements, X\,..., 
X m , are independent observations from a continu- 
ous, strictly increasing cumulative distribution F(-) 
and that the exposed measurements, Yi,...,Y n , are 
independent observations from a continuous, strictly 
increasing cumulative distribution G(-), with N = 
n + m. The distributions are shifted if there is some 
constant A such that Yj — A and Xi have the same 
distribution F(-), or equivalently if F{x) = G(x + A) 
for each x. We are interested in whether a shift 
model is compatible with the data, and if it is, in 
estimating A, and if it is not, in estimating some- 
thing else. Obviously, in the end, with finite amounts 
of data, there is going to be some uncertainty about 
both questions — whether a shift model is appropri- 
ate, what values of A are reasonable — but the goal 
is to describe the available information and the un- 
certainty that remains. 

2. TWO APPROACHES USING THE SAME 
INFORMATION 

2.1 A 4 x 2 Table 

Boxplots serve two purposes: they call attention 
to unique, extreme observations requiring individual 
attention, and they describe distributional shape in 



terms of quartiles. Here we focus on quartiles, and 
contrast two approaches to using (more or less) the 
depicted information to determine whether a shift, 
A, exists and what values of A are reasonable. We 
very much want to offer the two approaches exactly 
the same information, and then see which approach 
makes better use of this information; that is, we 
want to avoid complicating the comparison by offer- 
ing different information to the different approaches. 
Because the question about distributions is raised by 
the appearance of boxplots, the information offered 
to the methods is information about the quartiles. 

Consider the null hypothesis that the distributions 
are shifted by a specified amount, Ao, that is, Hq : 
F(x) = G(x + Aq). If this hypothesis were true, then 
Z\° = X\, . . . , = X m , Z^° +l = Y\ — A , . . . , 
Y n — Aq, would be N independent observations from 
F(-). Let Z,J < • • • < Zt]§\ be the order statistics, 

let qi = for i = 1,2,3, where \w~\ is the least 
integer greater than or equal to w, and define Z,\ 
to be the zth quartile. With N = 23 in the example, 
q± = 6, f/2 = 12, q3 = 18 and the quartiles are Z^°, 

z (i2) ' z ah • Write h = qi, h = q2 - qi, h = Q3 - 92 

and hi = N — q%, so for N = 23, ki =6, k 2 = 6, 
^3 = 6, &4 = 5. For the hypothesized Ao, form the 
4x2 contingency table in Table 1 which classifies 
the Z- by quartile and by treatment or control. 
Notice that the marginal totals of Table 1 are func- 
tions of the sample sizes, n, m and N = n + m, so 
they are fixed, not varying from sample to sample. 
If the null hypothesis is true, Table 1 has the mul- 
tivariate hypergeometric distribution 

Pr(Af = ai ,A^° = a 2 ,A%° = a 3 , = a 4 ) 

(k 2 \ tk- A \ lk 4 \ 
\ _ \ai' \a'2' \a3' Va4/ 



for < a,j < kj, j = 1, 2, 3, 4, a\ + 02 + 03 + 04 = n. 
Table 1 

Contingency table from pooled quartiles 



Quartile interval Treated Control Total 
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As a consequence, if the null hypothesis is true, the 
expected counts are 



(2) 



E ( A t°) = % J = 1,2,3,4, 



with variances and covariances 

A nmkj (N — kj) 



(3) 



var(^ 



cov(Af°,Af°) 



N 2 (N-1) ' 
nmhkj 



N 2 (N-l)' 



Write 



A Ao = (At ,At,A^,Af°) T , 
' nk\ nk2 nk% nk\ N 



E 



N ' N ' N ' N 



and V for the symmetric 4x4 covariance matrix of 
the hypergeometric defined by (3). Notice, in par- 
ticular, that Aa is a random vector whose value 
changes with Ao, whereas E and V are fixed matri- 
ces whose values do not change with Ao- For each 
value of Ao there is a table of the form Table 1, 
and if the distributions were actually shifted, F(x) = 
G(x + A), then the table with Ao = A has the mul- 
tivariate hypergeometric distribution. The informa- 
tion available to both methods of inference is this 
collection of 4 x 2 tables as Ao varies. 
A minor technical issue needs to be mentioned. 



Because Af + A$° + A£° + A% 
the covariance matrix V is singular, that is, posi- 
tive semidefinite but not positive definite. One could 
avoid this by focusing on (^4^° , A^° , A^° ) , but the 
4x2 table is too familiar to discard for this minor 
technicality. We define V~ as the specific general- 
ized inverse of V which has O's in its first row and 
column, and has in its bottom right 3x3 corner 
the inverse of the bottom right 3x3 corner of V; 
see Rao (1973, page 27). Although our notation al- 
ways refers to the 4x2 table, the calculations ul- 
timately use only the nondegenerate piece of the 
table, (A^° , A^° , Af°). In later sections, this issue 
comes up several times, always with minor conse- 
quences. 

The distribution (1) for Table 1 also arises in other 
ways. For instance, if there are N subjects and n are 
randomly assigned to treatment, with the remaining 
m = N — m subjects assigned to control, and if the 
treatment has an additive effect Ao, then (1) is the 
distribution for Table 1 without assuming samples 
from a population. 



A 



(A 



|A 



n is constant, 



2.2 Inference Based on Group Ranks 

A simple approach to inference about A assuming 
shifted distributions uses a group rank statistic, as 
discussed by Gastwirth (1966) and Markowski and 
Hettmansperger (1982, Section 5); see also Brown 
(1981) and Rosenbaum (1999). Here, the rows of Ta- 
ble 1 are assigned scores, w = (w±, W2, W3, Wi) T , with 
W\ = 0, and the hypothesis Hq : A = Ao is tested us- 
ing the group rank statistic Ta = w T Aa whose ex- 
act null distribution is determined using (1). An ex- 
act confidence interval for A is obtained by inverting 
the test; for example, see Lehmann (1963), Moses 
(1965) and Bauer (1972). The Hodges-Lehmann (1963) 
point estimate, Ahl> for this rank test is essentially 
the solution to the estimating equation 



(4) 



w T A~ = w T E. 



More precisely, with increasing scores, = w\ < W2 < 
u>3 < W4, because w T A~ moves in discrete steps as 

A varies continuously, the Hodges-Lehmann esti- 
mate is defined so that: (i) if equality in (4) can- 
not be achieved, then the estimate is the unique 
point A where w T A^ passes w T E, or (ii) if equal- 
ity is achieved for an interval of values of A then 
Ahl is defined to be the midpoint of the interval. In 
large samples, the null distribution of T Ao is approx- 
imately Normal with expectation w T E and variance 
w T Vw, so the deviate 



(5) 



w T (A Ao -E) 



v 7 w T Vw 

is compared with the standard Normal distribution. 
The Hodges-Lehmann estimate Ahl defined by (4) 
is essentially the same as the value A that minimizes 
Dl. 

A 

All of this assumes the distributions are indeed 
shifted. A simple test of the hypothesis that F(x) = 
G(x + Ao) is based on the statistic 

G , i = (A Ao -E) T V-(A Ao -E), 

whose exact null distribution follows from (1) and 
whose large sample null distribution is approximately 
X 2 on three degrees of freedom. It is useful to notice 
that under the null hypothesis E(G\ ) = tv[E{{A.^ — 
E)(Aa - E) T }V~] = 3, so the exact null distri- 
bution of G\ and the x 2 approximation have the 
same expectation. This will be relevant to certain 
comparisons to be made later. Is the shift model 
plausible for plausible values of Ao? A simple, in- 
formative procedure is to plot the exact, two-sided 
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P- values from D\ Q and G\ o against Ao- Curiosity 
is, of course, aroused by values Ao which are ac- 
cepted by D\ o and rejected by G Aq , because then 
the shift model is implausible for an ostensibly plau- 
sible shift. Greevy et al. (2004) do something similar. 

2.3 Inference Based on the Generalized Method 
of Moments 

In an important and influential paper, Hansen 
(1982) proposed a method for combining a number 
of estimating equations to estimate a smaller num- 
ber of parameters. In the current context, the Tables 
1 for different Ao yield the four estimating equations 
given by (2), one of which is redundant or linearly 
dependent on the other three. Obviously, there may 
be no Ao that satisfies all four equations at once, so 
Hansen proposed weighting the equations in an op- 
timal way. In particular, he showed that the optimal 
weighting of the equations uses the inverse covari- 
ance matrix of the moment conditions (2), and it is, 
in fact, the value A gmm that minimizes G\ Q : 

(6) A gmrn = arg min( A A - E) T V" ( - E) . 
A 

In theory, A gmm is asymptotically efficient, fully uti- 
lizing all of the information in the estimating equa- 
tions (2); see Hansen (1982) and the Appendix. Sur- 
veys of GMM are given by Matyas (1999) and Lind- 
say and Qu (2003). Hansen's results do not quite 
apply here, because certain differentiability assump- 
tions he makes are not strictly satisfied, but his con- 
clusions hold nonetheless, as discussed in the Ap- 
pendix where a result of Jureckova (1969) about the 
asymptotic linearity of rank statistics replaces dif- 
ferentiability. Hansen proposed a large sample test 
of the model or "identifying restrictions" (2) using 
the minimum value of G\ Q , that is, here, testing the 
family of models by comparing 



G 



(A S -E) T V~(A A 



E) 



to the x 2 distribution on two degrees of freedom. 
There does not appear to be an exact null distribu- 
tion for because it is computed not at A but 

rather at A gmm , which varies from sample to sam- 
ple, so the hypergeometric distribution for Aa is not 
relevant. One large sample test of Hq : A = Ao com- 
pares G\ - G\ to the chi -square distribution 

Agmm 

on one degree of freedom (Newey and West, 1987; 
Matyas, 1999, page 109), and a confidence set is the 
set of Ao not rejected by the test. See the Appendix. 



{c^(A x -E)} 2 



By a familiar fact (Rao, 1973, page 60) 

L A 

DUJJ 

C=(0,C2,C3,C4) C 

(7) 

so that 



c r Vc 



(A~-E) T V-(A~-E) = Gf 



(8) A gmm = argmin sup 

A c=(0,c 2 ,C3,c 4 ) 



{c^-E)} 2 
c T Vc 



whereas Ahl minimizes an analogous quantity, namely 
D~ in (5) for one specific set of weights w. The w 

that achieves the bound in (7) is w = V~ (A^ — E), 
so this optimizing w is not fixed, but is rather a 
function of the data. 

Both Ahl and A gmm use the same information, 
the information in Table 1 for varied Ao and the mo- 
ment equations (2); however, Ahl uses an a priori 
weighting of the equations yielding the one estimat- 
ing equation (4), while A gmm weights the equations 
(2) using V~. In the sense of (7), A gmm uses the 
"best" weights as judged by the sample, and asymp- 
totically it achieves the efficiency associated with 
knowing what are the best fixed weights to use; see 
the Appendix. Is it best to use the "best" weights? 

The generalized method of moments includes many 
familiar methods of estimation, including maximum 
likelihood, least squares and two-stage least squares 
with instrumental variables. In particular cases, such 
as weak instruments, it is known that poor estimates 
may result from GMM (e.g., Imbens, 1997; Staiger 
and Stock, 1997), but this is sometimes viewed as a 
weakness in the available data. A weak instrument 
may reduce efficiency but need not result in invalid- 
ity if appropriate methods of analysis are used (Im- 
bens and Rosenbaum, 2005). The two-sample shift 
problem is identified and presents no weakness in 
the data. 

2.4 Example 

The methods of Sections 2.2 and 2.3 will now be 
applied to the data in Figure 1, where the two box- 
plots have medians 8.8 and 15.6 differing by 6.8, and 
means 8.7 and 17.4 differing by 8.7. If the distribu- 
tions were shifted by A, then Ta would be distribu- 
tion free using (1), and the expectation of Ta with 
rank weights, Wj = j, j = 0, 1,2, 3, would be 22.957 
and the variance would be 6.1206. Now, T§.7 = 22, 
^8. 69999 = 23, so Ahl = 8.7, which is by coincidence 
the same as the difference in means. Also, G\ takes 
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Shifted Boxplots Using HL and GMM 
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Fig. 2. Boxplots of Yj — Ahl, Xi and Yj — A gmm . GMM failed to align the boxes: all of the quartiles of Yj — A gmm are above 
those of Xi . 



its smallest value, 3.176, on the half open interval 
[0.10, 4.90), so A gmm = ai0 + 4 - 90 = 2.5. In these data, 
which estimate, Ahl = 8.7 or A gmm = 2.5, looks bet- 
ter? 

Figure 2 compares Ahl and A gmm by displaying 
the control responses, Xi, together with the adjusted 
exposed responses, Yj — Ahl or Yj — A gmm . If the 
shift model were true, the estimates equaled the true 
shift, and the sample size were very large, then the 
three boxplots would look essentially the same. As 
it is, Yj — A gmm appears both too high and too dis- 
persed compared to the controls: all three quartiles 
of the Yj - A gmm are above the corresponding quar- 
tiles of the Xi; the median of Yj — A gmm is above 
the upper quartile of the Xf, the upper quartile of 
Yj — A gmm is above the maximum of Xi. In con- 
trast, the Yj — Ahl are shifted reasonably but ap- 
pear more dispersed than the Xf. the upper quartile 
of the Yj — Ahl is too high while the lower quartile 
is too low; indeed, the entire boxplot of the Xi fits 
inside the quartile box of the Yj — Ahl- Obviously, 
a shift can relocate a boxplot but cannot alter its 
dispersion. 



Assuming there is a shift A, the exact distribu- 
tion of the squared deviate D\ based on Ta is de- 
termined using (1) and it has Pr(D A > 4.156) = 
0.0436. The 1 - 0.0436 = 95.6% confidence set V is 
the closure of the set {/S.q:D\ o < 4.156}, which is 
[0.1, 19.5]. The test of fit of the shift model based on 
the generalized method of moments yields G~ = 

Agmm 

C?2. 5 = 3.176, which is compared to x 2 on two de- 
grees of freedom, yielding a significance level greater 
than 0.2. In short, the GMM test of the shift model 
based on suggests the shift model is plausi- 

Agmm 

ble, and the appearance of Figure 2 could be due 
to chance. Figure 3 is the plot, suggested in Section 
2.2, of the exact, two-sided P- values from and 
G\ plotted against Ao- To focus attention on small 
P-values, the vertical axis uses a logarithmic scale. 
To anchor that scale, horizontal lines are drawn at 
P = 0.05, 0.1 and 1/3. The solid step function for 
D\ cuts the horizontal P = 0.05 line at the end- 
points for the 95% confidence interval. As suggested 
by Mosteller and Tukey (1977), a 2/3 confidence 
interval is analogous to an estimate plus or minus 
a standard error. The Hodges-Lehmann estimate 
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is Ahl = 8.7. Notice, however, that at Ao = 8.69, 
the P- value for is 1.000, but the P- value for 
G\ is 0.021; that is, the shift model is implau- 
sible for a value of Ao judged highly plausible by 
D\ Q . Table 2 is Table 1 evaluated at A = 8.69, and 
from this table, it is easy to see what has happened. 
With Ao = 8.69 subtracted from treated responses, 
T 8 . 69 = 5x3 + 3x2 + 2x1 + 0x6 = 23, which is 
as close as possible to the null expectation 22.957 
of Ta, but because of the greater dispersion in the 
treated group, all 6 + 5 = 11 observations outside 
the pooled upper and lower quartiles are treated re- 
sponses, leading to a large G| 69 = 9.2 with exact 
significance level 0.021. The pattern in Table 2 is 
hardly a surprise: the comparison of dispersions is 
most decisive when it is not obscured by unequal 
locations. 

Because Ta is monotone in Ao, the 95%, 90% and 
2/3 confidence sets it yields in Figure 3 are inter- 
vals. In contrast, the 95%, 90% and 2/3 confidence 
sets based on G\ o are not intervals; for instance, 
the 90% confidence set is the union of three dis- 
joint intervals. If the confidence interval is defined 
as the shortest closed interval containing the con- 
fidence set, then the three intervals based on G\ 
are all longer than the corresponding intervals based 
on D\ Q . Figure 4 calculates the large sample confi- 
dence interval from GMM, plotting G\ — G~ 

against Ao- For instance, the dotted line labeled 
95% in Figure 4 is at 3.841, the 95% point of the 
chi-square distribution with one degree of freedom. 
The 95% confidence set for A is the set of Ao such 
that G\ — G~ < 3.841, and it is the union of 

two disjoint intervals; similarly, the 90% confidence 
set is the union of three disjoint intervals, and the 
2/3 confidence set is the union of two disjoint in- 
tervals. The shortest closed interval containing the 
95% confidence set for A is [—2.7,19.5], which is 
longer than the exact 95% confidence interval based 



Table 2 
Table for testing 8.69 



Quartile interval 


Treated 


Control 


Total 


Lowest 


6 





6 


Low 


2 


4 


6 


High 


3 


3 


6 


Highest 


5 





r> 


Total 


16 


7 


23 



on D\ o , namely [0.1,19.5]. Of course, both confi- 
dence intervals include values of A rejected by G Aq , 
so the shift model is not really plausible for some 
parameter values that both intervals report as plau- 
sible shifts. 

In short, the method of Section 2.2 gave a point 
estimate of Ahl = 8.7 consistent with the difference 
in means, but raised doubts about whether a shift 
model is appropriate, rejecting the shift model for 
Hq:A = 8.69. In contrast, the method of Section 
2.3 suggested the shift is much smaller, A gmm = 2.5, 
and the associated goodness of fit test based on 
G?^ suggested that the shift model is plausible. 

Obviously, the example just illustrates what the two 
methods do with one data set; it tells us nothing 
about performance in large or small samples. 

3. SIMULATION 

3.1 Structure of the Simulation 

The simulation considered three distributions F(-), 
namely the Normal (N), the Cauchy (C) and the 
convolution of a standard Normal with a standard 
Exponential (NE). Recall that the standard Normal 
and Exponential distributions each have variance 
one, so NE has variance two. Although the support 
of NE is the entire line, NE has a long right tail and 
a short left tail, and is moderately asymmetric near 
its median. There were 5,000 samples drawn for each 
sampling situation. 

We considered several estimators, including the 
two in Section 2.4, namely A gmm based on GMM 
and Ahl using scores w± = 0, W2 = 1, n?3 = 2, W4 = 
3. Note that the weights for Ahl are close to the op- 
timal weights for the Normal. The estimate Am with 
scores w± = 0, W2 = 0, W3 = 1, W4 = 1 is the Hodges- 
Lehmann point estimate associated with Mood's two 
sample median test; these scores are close to the 
optimal scores for the Cauchy. The estimate A mert 
with weights w\ = 0, W2 = 0.18, u/3 = 0.82, W4 = 1 
is Gastwirth's compromise weights for the Normal 
and the Cauchy; these scores are almost the same as 
w\ = 0, W2 = 1, 1^3 = 4, W4 = 5, so are much closer 
to Am than to Ahl- The coverage rates and behav- 
ior of confidence intervals and the null distribution 
of the goodness of fit test based on were also 

A 

"gmm 

examined. 

3.2 Efficiency 

Asymptotic theory says: (i) A gmm should always 
win in sufficiently large samples, (ii) Ahl should be 
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a 
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9 

d i i i i i 

5 10 15 20 
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Fig. 3. Plot of exact significance levels for testing Ho : A = Ao using D\ Q and G\ versus Ao. Note that G\ Q rejects the 
Aq that minimizes D\ Q . 



close to the best for the Normal, (iii) Am should 
be close to the best for the Cauchy and (iv) A mert 
should be better than Ahl for the Cauchy and bet- 
ter than Am for the Normal. 

Table 3 compares efficiency for samples from the 
Normal distribution. The values in the table are ra- 
tios of mean squared errors averaged over 5000 sam- 
ples, so the value 0.72 for n = m = 24 in A gmm : Ahl 
indicates that Ahl had a mean squared error that 
was 72% of the mean squared error of A gmm . The 



Table 3 

Efficiency for samples from the Normal distribution 



n 




24 


50 


20 


80 


500 


2,000 


10,000 


m 




24 


50 


80 


80 


500 


2,000 


10,000 


Agmm 


Ahl 


0.72 


0.76 


0.81 


0.79 


0.86 


0.89 


0.93 


Agmm 


Am 


1.00 


1.05 


1.08 


1.07 


1.14 


1.18 


1.24 


Ag mm 


A mcr t 


0.80 


0.86 


0.91 


0.89 


0.94 


0.97 


1.03 



Summary: Ahl is best in all cases, and A mcrt is second for 
n,m< 2000. 



predictions of large sample theory are qualitatively 
correct, but some of the quantitative results are strik- 
ing. The performance of A gmm improves with in- 
creasing sample size, but A gmm is still 7% behind 
Ahl with n = m = 10,000 and only marginally bet- 
ter than A mert . For sample sizes of n = m = 2,000 or 
less, A mert is better than A gmm , often substantially 
so. In Table 3, the relative performance of A gmm is 
still improving as the sample size increases, with the 
promised optimal performance not yet visible for the 
sample sizes in the table. With n = m = 40,000, not 
shown in the table, the relative efficiency A gmm is 
still about 5% behind Ahl- 

Table 4 is the analogous table for samples from 
the Cauchy distribution. As before, the relative per- 
formance of A gmm improves with increasing sample 
size, so that it is inferior to Ahl for n = m = 80 but 
superior for n = m = 500. In Table 4, Am is best ev- 
erywhere, as anticipated, but A mert is close behind, 
marginally ahead of A gmm even for n = m = 10,000, 
and well ahead for smaller sample sizes. Efficiency 
comparisons for the convolution of a Normal and an 
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Table 4 

Efficiency for samples from the Cauchy distribution 



n 




24 


50 


20 


80 


500 


2,000 


10,000 


rn 




24 


50 


80 


80 


500 


2,000 


10,000 


Agmm 


A HL 


0.82 


0.92 


0.90 


0.95 


1.07 


1.11 


1.18 


Agmm 


Am 


0.66 


0.75 


0.74 


0.76 


0.86 


0.90 


0.94 


Agmm 


A m ert 


0.70 


0.78 


0.78 


0.80 


0.89 


0.93 


0.99 



Summary: Am is best in all cases, and A mert is second in all 
cases. 



Exponential random variable are given in Table 5. 
The best estimator in all cases in Table 5 is Ahl- 
The relative performance of A gmm improves with in- 
creasing sample size, but it is still 5% behind Ahl 
for n = m = 10,000. Also, A mert is ahead of A gmm 
up to n = m = 2,000. 

In summary, the relative efficiency of the GMM 
estimator A gmm does increase with increasing sam- 
ple size, as the asymptotic theory says it should, but 
the improvement is remarkably slow. The fixed score 



estimator, A mcrt , is designed to achieve reasonable 
performance for both the Normal and the Cauchy, 
and it is more efficient than A gmm for all three sam- 
pling distributions in Tables 3 to 5 for sample sizes 
up to n = m = 2,000. 

3.3 Confidence Intervals 

In each of the 3 x 7 = 21 sampling situations in 
Tables 3-5, we computed the large sample nomi- 



Table 5 

Efficiency for samples from the Normal + Exponential 
distribution 



n 




24 


50 


20 


80 


500 


2,000 


10,000 


rn 




24 


50 


80 


80 


500 


2,000 


10,000 


Agmm 


Ahl 


0.85 


0.78 


0.76 


0.83 


0.87 


0.91 


0.95 


Agmm 


Am 


1.11 


1.03 


0.98 


1.08 


1.17 


1.19 


1.25 


Agmm 


Amcrt 


0.94 


0.85 


0.83 


0.91 


0.95 


0.99 


1.04 



Summary: Ahl is best in all cases, and A mcrt is second for 
m,n < 2000. 
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nally 95% confidence intervals from Ahl> A mert and 
A 

gmm> and empirically determined the actual cov- 
erage rate. For comparison, a binomial proportion 
with 5000 trials and probability of success 0.95 has 
standard error 0.003, so 0.95 ± (2 x 0.003) is 0.944 
to 0.956. 

The group rank confidence intervals performed well, 
with coverage close to the nominal level, even when 
large sample approximations were applied to small 
samples. All of the 2 x 21 = 42 simulated coverage 
rates for Ahl and A mert were between 93.7% and 
96.1% and only one was less than 94%. Exact inter- 
vals are available for Ahl an d A mort based on the 
hypergeometric distribution, but there is no need to 
simulate these, because their coverage rates are ex- 
actly as stated. The empirical coverages of the 95% 
confidence interval based on GMM are displayed in 
Table 6. As in Section 3.2, the asymptotic theory ap- 
pears correct in the limit but takes hold very slowly. 
The coverage of the nominal 95% interval is about 
90% for n = m = 80 and about 92% for n = m = 500. 
Also, by the results in Section 3.2, these intervals 
from A gmm not only have lower coverage than the 
intervals for the group rank test, but also for sample 
sizes up to tl — in — 2,000 the intervals from Ag mm 
are typically longer intervals than those from A mcrt 
as well. This is not much of a trade-off: lower cover- 
age combined with longer intervals. 

Finally, all of the confidence sets from Ahl and 
Amort are intervals, but for A gmm the confidence 
sets from GMM are often not intervals, and become 
intervals only by including interior segments that 
the test rejected. For A gmm for the Normal, with 
n = m = 24, only 49% of the confidence sets are in- 
tervals, rising to 67% for n = m = 500 and 83% for 
n = m = 10,000. Results for the other two distribu- 
tions are not very different. 

3.4 GMM's Goodness of Fit Test 

Recall that G% is often used test of the 

"gnun 

model, in the current context, comparing it to the 
chi-square distribution on two degrees of freedom. 
Here, too, the asymptotic properties appear true 
but are approached very slowly. In particular, when 
the shift model is correct, tends to be too 

Agmm 

small, compared to chi-square on two degrees of 
freedom, both in the tail and on average. For in- 
stance, the chi-square P-value from is less 

Agmm 

than 0.05 in 0.1% of samples of size n = m = 24 from 
the Normal, in 0.6% of samples of size n = m = 80, 



in 2% of samples of size n = m = 500, and 3.7% 
of samples of size n = m = 10,000. Similarly, in- 
stead of expectation 2 for a chi-square with two 
degrees of freedom, the expectation of G%- was 

0.9 for samples of size n = m = 24 from the Nor- 
mal, 1.1 for samples of size n = m = 80, 1.4 for 
samples of size n = m = 500, and 1.8 for samples 
of size n = m = 10,000. Similar results were found 
for the Cauchy and Normal + Exponential. In sharp 
contrast, one compares G\ to chi-square with three 
degrees of freedom, and E(G\) = 3 exactly in sam- 
ples of every size from every distribution; see Section 
2.2. In other words, replacing the true A by the es- 
timate A gmm and reducing the degrees of freedom 
by one to compensate is an adequate correction only 
in very large samples. To understand the behavior 
of G'i , it helps to recall what happened in the 

example in Section 2.4. There, G\ was minimized 
at a peculiar choice A gmrn of A, in part because 
G\ avoided not only implausible shifts but also ta- 
bles like Table 2 which suggest unequal dispersion. 
Having avoided Table 2 — that is, having avoided ev- 
idence of unequal dispersion by its choice of A gmm — 
the goodness of fit test, G~ , found no evidence 

of unequal dispersion. This suggests it may be best 
to separate two tasks, namely estimation assuming 
a model is true, and testing the goodness of fit of 
the model. 

4. DISPLACEMENT EFFECTS 

In the example in Section 1.2, the HL estimate 
gave a more reasonable estimate of shift than did the 
GMM estimate assuming the shift model to be true, 
and based on that estimate, raised clearer doubts 
about whether the shift model was appropriate. This 
is seen in Figures 2 and 3. Having raised doubts 
about the shift model, it is natural to seek exact in- 
ferences for the magnitude of the effect without as- 
suming a shift. The shift model is not needed for an 
exact inference comparing two distributions. There 
are 112 = 16 x 7 possible comparisons of the n = 16 
exposed subjects to the m = 7 controls, and in V = 
87 of these comparisons the exposed subject had 
a higher response, where V is the Mann-Whitney 
statistic. Under the null hypothesis of no treatment 
effect in a randomized experiment, the chance that 
V > 82 = 0.044. It follows from the argument in 
Rosenbaum (2001, Section 4) that in a randomized 
experiment, we would be 1 — 0.044 = 95.6% confi- 
dent that at least 87 — 82 + 1 = 6 of the 112 possible 
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Table 6 

Empirical coverage of nominal 95 % intervals from GMM 



n 


24 


50 


20 


80 


500 


2,000 


10,000 


m 


24 


50 


80 


80 


500 


2,000 


10,000 


Normal 


85.7 


89.7 


88.0 


90.1 


91.9 


92.3 


94.2 


Cauchy 


87.2 


90.9 


89.3 


91.3 


92.7 


93.6 


94.1 


Normal + Exponential 


87.2 


90.5 


87.6 


89.8 


92.1 


93.2 


94.3 



Summary: 95% intervals from GMM miss too often for n,m < 2,000. 



comparisons, or about 5% of them, favor the ex- 
posed group because of effects of the treatment, and 
the remaining 87 — 6 favorable comparisons could 
be due to chance. So the effect is not plausibly zero 
but could be quite small. The methods in Rosen- 
baum (2001, Section 5) may be used to display the 
sensitivity of this inference to departures from ran- 
dom assignment of treatments in an observational 
study of the sort described in Section 1.2. 

5. SUMMARY: WHAT ROLE FOR 
ASSUMPTIONS? 

In the radiation effects example in Section 1.2, the 
generalized method of moments (GMM) estimator, 
A gmm , estimated a small shift, one that did little 
to align the boxplots in Figure 2, and the associ- 
ated test of the shift model using G~ suggested 

that the shift model was plausible. In contrast, the 
Hodges-Lehmann estimate, Ahl> estimated a larger 
shift, one that did align the centers of the boxplots 
in Figure 2, but with this shift, the shift model 
seemed implausible, leading to the analysis in Sec- 
tion 4 which dispensed with the shift model. Al- 
though one should not make too much of a single, 
small example, our sense in this one instance was 
that GMM's results gave an incorrect impression of 
what the data had to say. 

The simulation considered situations in which the 
shift model was true. The promise of full asymptotic 
efficiency with GMM did seem to be true, but was 
very slow in coming, requiring astonishingly large 
sample sizes. For moderate sample sizes, m < 500, 
n < 500, GMM was neither efficient nor valid: it pro- 
duced longer 95% confidence intervals often with 
coverage well below 95%. In contrast, asymptotic 
results were a good guide to the performance of 
the group rank statistics for all sample sizes consid- 
ered, even for samples of size n = m = 24. Moreover, 
exact inference is straightforward with group rank 



statistics. The estimator A mer t aims to avoid bad 
performance under a range of assumptions rather 
than to achieve optimal performance under one set 
of assumptions. By every measure, in every situa- 
tion, A mcrt was better than A gmm for m < 2000, n < 
2000. Our goal has not been to provide a 
comprehensive analysis of the radiation effects ex- 
ample, nor to provide improved methods for esti- 
mating a shift. Rather, our goal was to create a labo- 
ratory environment — transparent, quiet, simple, 
undisturbed — in which two strategies for creating 
estimators might be compared. The laboratory con- 
ditions were favorable for GMM: (i) the shift param- 
eter is strongly identified, (ii) there are only three 
moment conditions and (iii) the optimal weight ma- 
trix for the moment conditions is known exactly and 
is free of unknown parameters. In a theorem, the 
assumptions are the premises of an argument, and 
for the sole purpose of proving the theorem, they 
play similar roles: the same conclusion with fewer 
assumptions is a "better" theorem, or at least bet- 
ter in certain important senses. When used in sci- 
entific applications, these same assumptions acquire 
different roles. As in the quote from Tukey in Sec- 
tion 1.1, assumptions needed for validity of confi- 
dence intervals and hypothesis tests play a different 
role from assumptions used for efficiency or strin- 
gency, and both play a very different role from the 
hypothesis itself. A familiar instance of this arises 
with hypotheses: omnibus hypotheses (ones that as- 
sume very little) are not automatically better hy- 
potheses than focused hypotheses (ones that assume 
much more) — power may be much higher for the 
focused hypotheses, and which is relevant depends 
on the science of the problem at hand. The trade- 
off discussed by Tukey is a less familiar instance. 
Here, we have examined a small, clean theoretical 
case study, in which the same information is used by 
different methods that embody different attitudes 
toward assumptions. The group rank statistics fol- 
lowed Tukey 's advice, in which validity was obtained 
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by general permutation test arguments, but the 
weights used in the tests reflected judgements in- 
formed by statistical theory in an effort to obtain 
decent efficiency for a variety of sampling distribu- 
tions. The generalized method of moments (GMM) 
tried to estimate the weights, and thereby always 
have the most efficient procedure, at least asymptot- 
ically. In point of fact, the gains in efficiency with 
GMM did not materialize until very large sample 
sizes were reached, whereas validity of confidence 
intervals was severely compromised in samples of 
conventional size. 

APPENDIX: LARGE SAMPLE THEORY 
UNDER LOCAL ALTERNATIVES 

This appendix discusses the asymptotic efficiency 
of GMM against local alternatives. Hansen's results 
about GMM concerned statistics that are differen- 
tiable in ways that rank statistics are not, but his 
conclusions hold nonetheless if differentiability is re- 
placed by asymptotic linearity using Theorem 1 from 
Jureckova (1969); see Theorem A. 2 below. Here we 
consider the limiting behavior of D& N and G\ 
when Hq-.A = An is false but nearly correct, that 
is, when F(x) = G(x + A) but A N = A - -^L, as 

iV = n + m — > oo with \jy = n/(n + m) — > A, < 
A < 1 . Because N — > oo , quantities from earlier sec- 
tions computed from the sample of size TV now have 
an N subscript, for example, E(An,a) = Ev and 
var(AAr iA ) = Vat. Then Zft^.. .,Z$ N arei.i.d. F(-), 

but Dn,a n arid G 2 N A are computed from Z^i \ 



T N,A 

i ^NN ' S0 Nl ' 



rA-6/VN 



J Nm 



correspond- 



ing to the X's are i.i.d. F(-), but Z^ m + \ N ', 



7 A-8/Vn 



J NN 



corresponding to the Y~'s have the same 



distribution but shifted upwards by 5/\ N. Assume 
F has density / with finite Fisher information and 
write 



and 



% = A(1-A)/ <p{uj)du for 5 = 1, 2,3,4. 



Let r\ = (771, 772, rjs, 774 ) T and notice that = J2 Vg by 
Hajek, Sidak, and Sen (1999, Lemma 1, page 18). 
Then a result of Jureckova (1969, Theorem 3.1, page 
1891) yields: 



Theorem A.l (Jureckova, 1969). For any fixed 
w = (0, W2, W3, w/C) T with = W\ < U>2 < W3 < W4, 
e>0 and T > 



lim Pr max 

Af^oo \\8\<r 



1 



-^=w (A NA _ s/%/N - 

- Aat 5 a) - Sw T r] 



>e 



N 



-w 



0. 



As N -> 00, one has (l/N)V N -> £ and NV 
£ _ where £ has entries cr,-,- = 3A(1 — A)/16 and 



v 



cr; 



— A(l — A)/16 for iy^j, and £~, like V^, has 
first row and column equal to zero. Moreover, The- 
orem A.l implies 



(9) 



w T (A 



N,A-8/VN 



E 



N, 



D 



\J w t Vtv w 



N 



<5w 77 
V 7 w T Sw 



The noncentrality parameter in (9), w ?;/vw r Sw, 
is maximized with w = Xl - ??, so the best group rank 
statistic has w = £"77. The GMM confidence in- 
terval for A was calculated by comparing G NA — 
to the chi-square distribution with one de- 



G n,a 



gree of freedom. Theorem A. 2 shows G 2 N A — G 



converges in probability (henceforth 
group rank statistic. 



N,A 
to the best 



Theorem A. 2. If F has finite Fisher informa- 
tion, then 



{(A 



AT, A 



E 



(10) 



r] T V N rj 



(G%,a ~ G 2 



N,A„ 



0. 



Proof. Write ||a]|jv 



7^(Atv,a 



Na T V N a. Then G 2 N A 



E 



N)\\ N 



N,A-5/VN 

E N )\\ N . Define & NA _ S/VN -. „^ 
(Aa^a — Etv) + 5r7||^r, which is quadratic in 6, and 



(■^N,AS/Vn 

ljv,A — Ejv 
is minimized at 

5n = - 



>N 

1 



Nr, T \ N n 



I/vWXAata-E 



r? T £ n 



D 



N[ 
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moreover, G 2 N A _ s ^^ has minimum value G 2 NA — 

{(Aat 5 a — ^n) T ^nV} 2 /v T ^N r 1- Let £r be the event 
£t = {V~N\A gmm — A| < T and \8n\ < T}. It is pos- 
sible to pick a large T > such that for all suffi- 
ciently large N, the probability Pr(ff) is arbitrarily 
large. Therefore, in proving (10), we assume £y has 
occurred. Write Vjv,<5 = {( A n,A-$/Vn~ a n,a)/^}- 
Srj and note that Theorem A.l implies 

max|5| <x \\ip NyS \\ 2 N 1 
for any norm || • || 



0. By the triangle inequality, 



(11) 



a||- ||bf | 

< ||a— b|| 2 + 2||a — b|| max(||a||. 
Witha= (A NA _ S/ Kj-E N )/y/N and b 



Ib| 



B N )/^/N + Srj, so a - b 



(Ajv,a ■ 
ip NS , then (11) yields 



a 2 —i 



< max 
|5|<T 



2 

N,8\\N 



211^^11^ 



max^ 



C 2 

N,A-ij^' 



f<2 

N,A-5/y/N 



)} 



so that 

^ G N,A-8/VN 
which is (10). □ 
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